Overview

Dataset Statistics

Number of Variables 27
Number of Rows 396030
Missing Cells 81589
Missing Cells (%) 0.8%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 421.4 MB
Average Row Size in Memory 1.1 KB
Variable Types
  • Numerical: 12
  • Categorical: 15

Dataset Insights

emp_title has 22927 (5.79%) missing values Missing
emp_length has 18301 (4.62%) missing values Missing
mort_acc has 37795 (9.54%) missing values Missing
annual_inc is skewed Skewed
dti is skewed Skewed
open_acc is skewed Skewed
pub_rec is skewed Skewed
revol_bal is skewed Skewed
revol_util is skewed Skewed
mort_acc is skewed Skewed
pub_rec_bankruptcies is skewed Skewed
emp_title has a high cardinality: 173105 distinct values High Cardinality
issue_d has a high cardinality: 115 distinct values High Cardinality
title has a high cardinality: 48817 distinct values High Cardinality
earliest_cr_line has a high cardinality: 684 distinct values High Cardinality
address has a high cardinality: 393700 distinct values High Cardinality
term has constant length 10 Constant Length
grade has constant length 1 Constant Length
sub_grade has constant length 2 Constant Length
issue_d has constant length 8 Constant Length
earliest_cr_line has constant length 8 Constant Length
initial_list_status has constant length 1 Constant Length
pub_rec has 338272 (85.42%) zeros Zeros
mort_acc has 139777 (35.29%) zeros Zeros
pub_rec_bankruptcies has 350380 (88.47%) zeros Zeros
  • 1
  • 2
  • 3

Variables

loan_amnt

numerical

Approximate Distinct Count 1397
Approximate Unique (%) 0.4%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 14113.8881
Minimum 500
Maximum 40000
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • loan_amnt is skewed right (γ1 = 0.7773)

Quantile Statistics

Minimum 500
5-th Percentile 3350
Q1 8000
Median 12000
Q3 20000
95-th Percentile 31300
Maximum 40000
Range 39500
IQR 12000

Descriptive Statistics

Mean 14113.8881
Standard Deviation 8357.4413
Variance 6.9847e+07
Sum 5.5895e+09
Skewness 0.7773
Kurtosis -0.06261
Coefficient of Variation 0.5921
  • loan_amnt is not normally distributed (p-value 0.0009417442947102507)
  • loan_amnt has 191 outliers

term

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 28.3 MB
  • The largest value ( 36 months) is over 3.21 times larger than the second largest value ( 60 months)

Length

Mean 10
Standard Deviation 0
Median 10
Minimum 10
Maximum 10

Sample

1st row 36 months
2nd row 36 months
3rd row 36 months
4th row 36 months
5th row 60 months

Letter

Count 2376180
Lowercase Letter 2376180
Space Separator 792060
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 792060
  • The top 2 categories ( 36 months, 60 months) take over 50.0%
  • term has words of constant length

int_rate

numerical

Approximate Distinct Count 566
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 13.6394
Minimum 5.32
Maximum 30.99
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • int_rate is skewed right (γ1 = 0.4207)

Quantile Statistics

Minimum 5.32
5-th Percentile 6.89
Q1 10.49
Median 13.33
Q3 16.55
95-th Percentile 21.98
Maximum 30.99
Range 25.67
IQR 6.06

Descriptive Statistics

Mean 13.6394
Standard Deviation 4.4722
Variance 20.0002
Sum 5.4016e+06
Skewness 0.4207
Kurtosis -0.144
Coefficient of Variation 0.3279
  • int_rate has 3148 outliers

installment

numerical

Approximate Distinct Count 55706
Approximate Unique (%) 14.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 431.8497
Minimum 16.08
Maximum 1533.81
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • installment is skewed right (γ1 = 0.9836)

Quantile Statistics

Minimum 16.08
5-th Percentile 111.4125
Q1 250.33
Median 375.85
Q3 569.95
95-th Percentile 929.87
Maximum 1533.81
Range 1517.73
IQR 319.62

Descriptive Statistics

Mean 431.8497
Standard Deviation 250.7278
Variance 62864.4244
Sum 1.7103e+08
Skewness 0.9836
Kurtosis 0.7838
Coefficient of Variation 0.5806
  • installment has 10734 outliers

grade

categorical

Approximate Distinct Count 7
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 24.9 MB

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row B
2nd row B
3rd row B
4th row A
5th row C

Letter

Count 396030
Lowercase Letter 0
Space Separator 0
Uppercase Letter 396030
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (B, C) take over 50.0%
  • grade has words of constant length

sub_grade

categorical

Approximate Distinct Count 35
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 25.3 MB

Length

Mean 2
Standard Deviation 0
Median 2
Minimum 2
Maximum 2

Sample

1st row B4
2nd row B5
3rd row B3
4th row A2
5th row C5

Letter

Count 396030
Lowercase Letter 0
Space Separator 0
Uppercase Letter 396030
Dash Punctuation 0
Decimal Number 396030
  • sub_grade has words of constant length

emp_title

categorical

Approximate Distinct Count 173105
Approximate Unique (%) 46.4%
Missing 22927
Missing (%) 5.8%
Memory Size 29.0 MB

Length

Mean 16.5867
Standard Deviation 8.0121
Median 16
Minimum 1
Maximum 78

Sample

1st row Marketing
2nd row Credit analyst
3rd row Statistician
4th row Client Advocate
5th row Destiny Management...

Letter

Count 5641197
Lowercase Letter 4701326
Space Separator 487836
Uppercase Letter 939871
Dash Punctuation 5541
Decimal Number 6383
  • emp_title contains many words: 53640 words
  • The largest value (manager) is over 3.75 times larger than the second largest value (inc)

emp_length

categorical

Approximate Distinct Count 11
Approximate Unique (%) 0.0%
Missing 18301
Missing (%) 4.6%
Memory Size 26.2 MB
  • The largest value (10+ years) is over 3.52 times larger than the second largest value (2 years)

Length

Mean 7.6828
Standard Deviation 1.0104
Median 7
Minimum 6
Maximum 9

Sample

1st row 10+ years
2nd row 4 years
3rd row < 1 year
4th row 6 years
5th row 9 years

Letter

Count 1831038
Lowercase Letter 1831038
Space Separator 409454
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 503770
  • The largest value (years) is over 2.54 times larger than the second largest value (10+)

home_ownership

categorical

Approximate Distinct Count 6
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 26.8 MB

Length

Mean 5.9083
Standard Deviation 2.1136
Median 8
Minimum 3
Maximum 8

Sample

1st row RENT
2nd row MORTGAGE
3rd row RENT
4th row RENT
5th row MORTGAGE

Letter

Count 2339875
Lowercase Letter 0
Space Separator 0
Uppercase Letter 2339875
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (MORTGAGE, RENT) take over 50.0%

annual_inc

numerical

Approximate Distinct Count 27197
Approximate Unique (%) 6.9%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 74203.1758
Minimum 0
Maximum 8.7066e+06
Zeros 1
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • annual_inc is skewed right (γ1 = 41.0426)

Quantile Statistics

Minimum 0
5-th Percentile 28000
Q1 45000
Median 64000
Q3 90000
95-th Percentile 150000
Maximum 8.7066e+06
Range 8.7066e+06
IQR 45000

Descriptive Statistics

Mean 74203.1758
Standard Deviation 61637.6212
Variance 3.7992e+09
Sum 2.9387e+10
Skewness 41.0426
Kurtosis 4238.497
Coefficient of Variation 0.8307
  • annual_inc is not normally distributed (p-value 4.4293503625143945e-25)
  • annual_inc has 16700 outliers

verification_status

categorical

Approximate Distinct Count 3
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 28.9 MB

Length

Mean 11.5856
Standard Deviation 2.9073
Median 12
Minimum 8
Maximum 15

Sample

1st row Not Verified
2nd row Not Verified
3rd row Source Verified
4th row Not Verified
5th row Verified

Letter

Count 4331796
Lowercase Letter 3679299
Space Separator 256467
Uppercase Letter 652497
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (Verified, Source Verified) take over 50.0%
  • The largest value (verified) is over 3.01 times larger than the second largest value (source)

issue_d

categorical

Approximate Distinct Count 115
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 27.6 MB

Length

Mean 8
Standard Deviation 0
Median 8
Minimum 8
Maximum 8

Sample

1st row Jan-2015
2nd row Jan-2015
3rd row Jan-2015
4th row Nov-2014
5th row Apr-2013

Letter

Count 1188090
Lowercase Letter 792060
Space Separator 0
Uppercase Letter 396030
Dash Punctuation 396030
Decimal Number 1584120
  • issue_d has words of constant length

loan_status

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 28.4 MB
  • The largest value (Fully Paid) is over 4.1 times larger than the second largest value (Charged Off)

Length

Mean 10.1961
Standard Deviation 0.3971
Median 10
Minimum 10
Maximum 11

Sample

1st row Fully Paid
2nd row Fully Paid
3rd row Fully Paid
4th row Fully Paid
5th row Charged Off

Letter

Count 3641943
Lowercase Letter 2849883
Space Separator 396030
Uppercase Letter 792060
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (Fully Paid, Charged Off) take over 50.0%

purpose

categorical

Approximate Distinct Count 14
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 30.2 MB
  • The largest value (debt_consolidation) is over 2.82 times larger than the second largest value (credit_card)

Length

Mean 14.9978
Standard Deviation 4.2735
Median 18
Minimum 3
Maximum 18

Sample

1st row vacation
2nd row debt_consolidation
3rd row credit_card
4th row credit_card
5th row credit_card

Letter

Count 5583221
Lowercase Letter 5583221
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (debt_consolidation, credit_card) take over 50.0%
  • The largest value (debt_consolidation) is over 2.82 times larger than the second largest value (credit_card)

title

categorical

Approximate Distinct Count 48817
Approximate Unique (%) 12.4%
Missing 1755
Missing (%) 0.4%
Memory Size 30.9 MB
  • The largest value (Debt consolidation) is over 2.96 times larger than the second largest value (Credit card refinancing)

Length

Mean 17.2411
Standard Deviation 5.7392
Median 18
Minimum 2
Maximum 80

Sample

1st row Vacation
2nd row Debt consolidation
3rd row Credit card refina...
4th row Credit card refina...
5th row Credit Card Refina...

Letter

Count 6276985
Lowercase Letter 5749397
Space Separator 494561
Uppercase Letter 527588
Dash Punctuation 1929
Decimal Number 13723
  • The top 2 categories (Debt consolidation, Credit card refinancing) take over 50.0%
  • title contains many words: 14009 words

dti

numerical

Approximate Distinct Count 4262
Approximate Unique (%) 1.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 17.3795
Minimum 0
Maximum 9999
Zeros 313
Zeros (%) 0.1%
Negatives 0
Negatives (%) 0.0%
  • dti is skewed right (γ1 = 431.0496)

Quantile Statistics

Minimum 0
5-th Percentile 4.73
Q1 11.32
Median 16.93
Q3 23.04
95-th Percentile 31.68
Maximum 9999
Range 9999
IQR 11.72

Descriptive Statistics

Mean 17.3795
Standard Deviation 18.0191
Variance 324.6877
Sum 6.8828e+06
Skewness 431.0496
Kurtosis 237920.6726
Coefficient of Variation 1.0368
  • dti is not normally distributed (p-value 4.2265140739354965e-25)
  • dti has 268 outliers

earliest_cr_line

categorical

Approximate Distinct Count 684
Approximate Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Memory Size 27.6 MB

Length

Mean 8
Standard Deviation 0
Median 8
Minimum 8
Maximum 8

Sample

1st row Jun-1990
2nd row Jul-2004
3rd row Aug-2007
4th row Sep-2006
5th row Mar-1999

Letter

Count 1188090
Lowercase Letter 792060
Space Separator 0
Uppercase Letter 396030
Dash Punctuation 396030
Decimal Number 1584120
  • earliest_cr_line has words of constant length

open_acc

numerical

Approximate Distinct Count 61
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 11.3112
Minimum 0
Maximum 90
Zeros 6
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • open_acc is skewed right (γ1 = 1.213)

Quantile Statistics

Minimum 0
5-th Percentile 5
Q1 8
Median 10
Q3 14
95-th Percentile 21
Maximum 90
Range 90
IQR 6

Descriptive Statistics

Mean 11.3112
Standard Deviation 5.1376
Variance 26.3954
Sum 4.4796e+06
Skewness 1.213
Kurtosis 2.9669
Coefficient of Variation 0.4542
  • open_acc is not normally distributed (p-value 2.8778690689343605e-09)
  • open_acc has 10307 outliers

pub_rec

numerical

Approximate Distinct Count 20
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 0.1782
Minimum 0
Maximum 86
Zeros 338272
Zeros (%) 85.4%
Negatives 0
Negatives (%) 0.0%
  • pub_rec is skewed right (γ1 = 16.5765)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 1
Maximum 86
Range 86
IQR 0

Descriptive Statistics

Mean 0.1782
Standard Deviation 0.5307
Variance 0.2816
Sum 70569
Skewness 16.5765
Kurtosis 1867.4431
Coefficient of Variation 2.9781
  • pub_rec is not normally distributed (p-value 4.303050012884538e-25)
  • pub_rec has 57758 outliers

revol_bal

numerical

Approximate Distinct Count 55622
Approximate Unique (%) 14.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 15844.5399
Minimum 0
Maximum 1.7433e+06
Zeros 2128
Zeros (%) 0.5%
Negatives 0
Negatives (%) 0.0%
  • revol_bal is skewed right (γ1 = 11.7275)

Quantile Statistics

Minimum 0
5-th Percentile 1716
Q1 6060
Median 11204
Q3 19663.25
95-th Percentile 41237.3
Maximum 1.7433e+06
Range 1.7433e+06
IQR 13603.25

Descriptive Statistics

Mean 15844.5399
Standard Deviation 20591.8361
Variance 4.2402e+08
Sum 6.2749e+09
Skewness 11.7275
Kurtosis 384.2162
Coefficient of Variation 1.2996
  • revol_bal is not normally distributed (p-value 5.415279189419741e-25)
  • revol_bal has 21181 outliers

revol_util

numerical

Approximate Distinct Count 1226
Approximate Unique (%) 0.3%
Missing 276
Missing (%) 0.1%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 53.7917
Minimum 0
Maximum 892.3
Zeros 2213
Zeros (%) 0.6%
Negatives 0
Negatives (%) 0.0%
  • revol_util is skewed left (γ1 = -0.0718)

Quantile Statistics

Minimum 0
5-th Percentile 11.5
Q1 36
Median 54.9
Q3 73
95-th Percentile 92.1
Maximum 892.3
Range 892.3
IQR 37

Descriptive Statistics

Mean 53.7917
Standard Deviation 24.4522
Variance 597.9097
Sum 2.1288e+07
Skewness -0.07178
Kurtosis 2.7122
Coefficient of Variation 0.4546
  • revol_util is not normally distributed (p-value 8.746266244366458e-12)
  • revol_util has 12 outliers

total_acc

numerical

Approximate Distinct Count 118
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 25.4147
Minimum 2
Maximum 151
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • total_acc is skewed right (γ1 = 0.8643)

Quantile Statistics

Minimum 2
5-th Percentile 9
Q1 17
Median 24
Q3 32
95-th Percentile 47
Maximum 151
Range 149
IQR 15

Descriptive Statistics

Mean 25.4147
Standard Deviation 11.887
Variance 141.3005
Sum 1.0065e+07
Skewness 0.8643
Kurtosis 1.2046
Coefficient of Variation 0.4677
  • total_acc is not normally distributed (p-value 0.00012141157673806023)
  • total_acc has 8499 outliers

initial_list_status

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 24.9 MB
  • The largest value (f) is over 1.51 times larger than the second largest value (w)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row w
2nd row f
3rd row f
4th row f
5th row f

Letter

Count 396030
Lowercase Letter 396030
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (f, w) take over 50.0%
  • The largest value (f) is over 1.51 times larger than the second largest value (w)
  • initial_list_status has words of constant length

application_type

categorical

Approximate Distinct Count 3
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 28.3 MB
  • The largest value (INDIVIDUAL) is over 930.16 times larger than the second largest value (JOINT)

Length

Mean 9.9946
Standard Deviation 0.1637
Median 10
Minimum 5
Maximum 10

Sample

1st row INDIVIDUAL
2nd row INDIVIDUAL
3rd row INDIVIDUAL
4th row INDIVIDUAL
5th row INDIVIDUAL

Letter

Count 3957889
Lowercase Letter 0
Space Separator 0
Uppercase Letter 3957889
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (INDIVIDUAL, JOINT) take over 50.0%
  • The largest value (individual) is over 930.16 times larger than the second largest value (joint)

mort_acc

numerical

Approximate Distinct Count 33
Approximate Unique (%) 0.0%
Missing 37795
Missing (%) 9.5%
Infinite 0
Infinite (%) 0.0%
Memory Size 5.5 MB
Mean 1.814
Minimum 0
Maximum 34
Zeros 139777
Zeros (%) 35.3%
Negatives 0
Negatives (%) 0.0%
  • mort_acc is skewed right (γ1 = 1.6001)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 1
Q3 3
95-th Percentile 6
Maximum 34
Range 34
IQR 3

Descriptive Statistics

Mean 1.814
Standard Deviation 2.1479
Variance 4.6136
Sum 649835
Skewness 1.6001
Kurtosis 4.4771
Coefficient of Variation 1.1841
  • mort_acc is not normally distributed (p-value 3.0438108581877407e-18)
  • mort_acc has 6843 outliers

pub_rec_bankruptcies

numerical

Approximate Distinct Count 9
Approximate Unique (%) 0.0%
Missing 535
Missing (%) 0.1%
Infinite 0
Infinite (%) 0.0%
Memory Size 6.0 MB
Mean 0.1216
Minimum 0
Maximum 8
Zeros 350380
Zeros (%) 88.5%
Negatives 0
Negatives (%) 0.0%
  • pub_rec_bankruptcies is skewed right (γ1 = 3.4234)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 1
Maximum 8
Range 8
IQR 0

Descriptive Statistics

Mean 0.1216
Standard Deviation 0.3562
Variance 0.1269
Sum 48111
Skewness 3.4234
Kurtosis 18.1039
Coefficient of Variation 2.9279
  • pub_rec_bankruptcies is not normally distributed (p-value 9.141517016810044e-25)
  • pub_rec_bankruptcies has 45115 outliers

address

categorical

Approximate Distinct Count 393700
Approximate Unique (%) 99.4%
Missing 0
Missing (%) 0.0%
Memory Size 41.4 MB

Length

Mean 44.714
Standard Deviation 7.7432
Median 46
Minimum 20
Maximum 69

Sample

1st row 0174 Michelle Gate...
2nd row 1076 Carney Fort A...
3rd row 87025 Mark Dale Ap...
4th row 823 Reid Ford Del...
5th row 679 Luna Roads Gr...

Letter

Count 10179354
Lowercase Letter 7690715
Space Separator 2128626
Uppercase Letter 2488639
Dash Punctuation 0
Decimal Number 4151920
  • address contains many words: 291208 words

Interactions

Correlations

Missing Values